Adapting SVM for data sparseness and imbalance: a case study in information extraction

نویسندگان

  • Yaoyong Li
  • Kalina Bontcheva
  • Hamish Cunningham
چکیده

Support Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark datasets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Remote Sensing and Land Use Extraction for Kernel Functions Analysis by Support Vector Machines with ASTER Multispectral Imagery

Land use is being considered as an element in determining land change studies, environmental planning and natural resource applications. The Earth’s surface Study by remote sensing has many benefits such as, continuous acquisition of data, broad regional coverage, cost effective data, map accurate data, and large archives of historical data. To study land use / cover, remote sensing as an effic...

متن کامل

Mapping the Potential of Groundwater Resources in Hard Formations Using Geographic Information System and Remote Sensing, Case Study: Northwest of Shahroud

In recent years, rapid population growth has led to increase per capita water use in various sectors including agriculture and industry and a growing gap between water demand and water supply has emerged. Therefore, identifying and tracking changes in groundwater resources as an alternative and reliable source of surface water resources are so important to region located in the Middle East with...

متن کامل

Developing a New Method in Object Based Classification to Updating Large Scale Maps with Emphasis on Building Feature

According to the cities expansion, updating urban maps for urban planning is important and its effectiveness is depend on the information extraction / change detection accuracy. Information extraction methods are divided into two groups, including Pixel-Based (PB) and Object-Based (OB). OB analysis has overcome the limitations of PB analysis (producing salt-pepper results and features with hole...

متن کامل

A prediction distribution of atmospheric pollutants using support vector machines, discriminant analysis and mapping tools (Case study: Tunisia)

Monitoring and controlling air quality parameters form an important subject of atmospheric and environmental research today due to the health impacts caused by the different pollutants present in the urban areas. The support vector machine (SVM), as a supervised learning analysis method, is considered an effective statistical tool for the prediction and analysis of air quality. The work present...

متن کامل

A prediction distribution of atmospheric pollutants using support vector machines, discriminant analysis and mapping tools (Case study: Tunisia)

Monitoring and controlling air quality parameters form an important subject of atmospheric and environmental research today due to the health impacts caused by the different pollutants present in the urban areas. The support vector machine (SVM), as a supervised learning analysis method, is considered an effective statistical tool for the prediction and analysis of air quality. The work present...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Natural Language Engineering

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2009